Generating hypotheses in data sets

ABSTRACT

A method for generating hypotheses in a corpus of data comprises selecting a form of ontology; coding the corpus of data based on the form of the ontology; generating ontology space based on coding results and the ontology; transforming the ontology space into a hypothesis space by grouping hypotheses; weighing hypotheses included in the hypothesis space; and applying a science-based optimization algorithm configured to model a science-based treatment of the weighted hypotheses.

CROSS REFERENCE TO RELATED APPLICATIONS

This application is a continuation-in-part of U.S. patent applicationSer. No. 14/597,652, filed Jan. 15, 2015, which claims priority to U.S.Provisional Patent Application No. 61/927,532, filed Jan. 15, 2014, bothof which are incorporated by reference herein in their entirety.

FIELD

This present disclosure relates generally to a system, method, andstorage medium for data analysis.

BACKGROUND

The total amount of digital information on global networks is increasingexponentially. The service and information provided by the networks havegrown from just emails to discussion forums, instant messaging, newsreporting, content sharing, social networking, online shopping,publication, library, opinion polling, cloud services, and so on. Ahuman is not capable of reviewing every piece of information from thissea of information and quickly identifying all relevant data for a givensubject or project. Technologies employing artificial intelligence havebeen developed to ease the data mining and reviewing tasks to assisthuman users and digital agents to quickly discover relevant data fromlarge data sets, knowledge stores, and associated computer networks.

The demand for processing large amounts of digital data in real time isparticularly heightened in the area of national security. Agencies facedwith on-going digital and physical threats from various parts of theworld are tasked with warning communities before an attack, implementingemergency preparedness, securing borders and transportation arteries,protecting critical infrastructure and key assets, and defending againstcatastrophic terrorism. What is most critical to achieve these tasks isan agency's capability to detect potential attacks early on and monitorsuch plots continuously before they are carried out. The data on globalnetworks can potentially give an information seeking organization allthe information they need. However, the key question is how toeffectively and carefully sort and search vast amount of data.

Similar demands also exist in other surveillance areas including publichealth, public opinion, consumer products, and morale.

Current practices in identifying information of interest from a largeamount of data includes the use of keyword searches to look for specificinformation, the use of Bayesian classifiers to divide information, andthe use of logistic regression to look for risk factors of predefined ordesired outcomes. These practices, by their nature, however cannotidentify surprises, latest developments, or novel plots because thesesearches rely on a human conceived and defined set of interests orknowledge that a computer-aided search treats as a priori knowledge.This pre-set boundary limits the capability of a search to detect andidentify unexpected events.

SUMMARY

To overcome the issues associated with current hypothesis generationtechniques, the present disclosure presents a computer implementedmethod that allows the data itself to define a space of possiblehypotheses, which optionally merges and groups similar hypotheses, andthen weights and selects a subset of relevant hypotheses for furtherconsideration by a human analyst. The computerized method uses atheoretical and physical basis to implement hypothesis generation.Specifically, a simulated annealing technique is applied and provides anunderstood, validated theoretical construct by which the problem ofhypothesis generation can be solved. A weighing algorithm is appliedthat expresses the goal as an optimization problem. Moreover, thisend-to-end approach is easily communicated due to the physics-basedanalogue, which is applicable to textual, audio, and video data,executable in real time or near-real time, and scalable to realisticapplications. The method is also domain agnostic; namely, the method isgeneralized and interoperable among various systems or domains.

According to some embodiments, disclosed is a method for generatinghypotheses in a corpus of data. The method comprises selecting a form ofontology configured as one or more ontology vectors; coding the corpusof data based on the form of the ontology vector; generating an ontologyspace based on coding results and the ontology form; transforming theontology space into a hypothesis space by grouping hypotheses; weighingthe hypotheses included in the hypothesis space; and applying arandom-walk process configured to model a physics-based treatmentprocess to the weighing results of the hypotheses.

According to some embodiments, the random-walk process is guided toexplore hypotheses less likely to be anticipated. That is, hypothesesthat are anticipated with a greater degree of expectation are discardedin favor of exploring unanticipated hypotheses. That is, the random walkfavors, without loss of generality, nonintuitive, nonconventional, andpotentially, but not necessarily, hypotheses within the hypothesis spacethat have a low probability of occurrence.

According to another embodiment, the random-walk process is configuredas a simulated annealing process.

According to yet another embodiment, the ontology space and thehypothesis space are fully computer-generated.

According to yet another embodiment, a hypothesis surface of thehypothesis space includes troughs whose depth indicates relevancy of ahypothesis neighborhood.

According to yet another embodiment, the method further comprisespresenting a color map associated with the hypothesis space whose colorbrightness indicates the relevancy of a hypothesis neighborhood.

According to yet another embodiment, the method further comprisespresenting an R-dimensional space representation projected onto a lowerdimensional space, namely, an S-dimensional space where S<R.

According to yet another embodiment, the method further comprisesidentifying global minima as the relevant hypothesis or hypotheses.

According to yet another embodiment, the random-walk process is appliedrepeatedly with an increased hop-distance each time.

According to another embodiment, the random-walk process is configuredas a genetic algorithm process.

BRIEF DESCRIPTION OF THE DRAWINGS

To the accomplishment of the foregoing and related ends, certainillustrative embodiments of the invention are described herein inconnection with the following description and the annexed drawings.These embodiments are indicative, however, of but a few of the variousways in which the principles of the invention may be employed and thepresent application is intended to include all such aspects and theirequivalents. Other advantages, embodiments and novel features of theinvention may become apparent from the following description of thepresent invention when considered in conjunction with the drawings. Thefollowing description, given by way of example, but not intended tolimit the present invention solely to the specific embodimentsdescribed, may best be understood in conjunction with the accompanyingdrawings, in which:

FIG. 1 illustrates an embodiment of a network.

FIG. 2 illustrates an embodiment of a computer device.

FIG. 3 illustrates an embodiment of a corpus of information.

FIG. 4(a) illustrates an embodiment of ontology.

FIG. 4(b) illustrates another embodiment of ontology.

FIG. 5 illustrates an embodiment of a hypothesis generation method.

FIG. 6 illustrates an embodiment of ontology space.

FIG. 7 illustrates an embodiment of ontology space.

FIG. 8 illustrates an embodiment of a hypothesis space.

FIG. 9 illustrates an embodiment of a hypothesis space.

FIG. 10 illustrates an embodiment of a ranked hypothesis space.

FIG. 11 illustrates an embodiment of a color map of a hypothesis space.

FIG. 12 illustrates an embodiment of a hypothesis surface indicatingweighted hypothesis space.

FIG. 13 illustrate an embodiment of a cycle of a simulated annealingprocess.

FIG. 14 illustrates an embodiment of a cycle that is rejected in asimulated annealing process.

FIG. 15 illustrates an embodiment of a cycle of a simulated annealingprocess.

FIG. 16 illustrates an embodiment of a result of a simulated annealingprocess.

FIG. 17 is a flowchart of a method for hypotheses generation that isoptimized and filtered to bias towards a level of potential interest.

FIG. 18 is a table illustrating a hypothetical corpus of informationthat can be collected.

DETAILED DESCRIPTION

Those of ordinary skill in the art will realize that the description ofthe present application is illustrative only and not in any waylimiting. Other embodiments of the invention will readily suggestthemselves to such skilled persons, having the benefit of thisdisclosure. Reference will be made in detail to specific implementationsof the present application as illustrated in the accompanying drawings.

Further, certain figures in this specification are flow chartsillustrating methods and systems. It will be understood that each blockof these flow charts, and combinations of blocks in these flow charts,may be implemented by computer program instructions. These computerprogram instructions may be loaded onto a computer or other programmableapparatus to produce a machine, such that the instructions which executeon the computer or other programmable apparatus create structures forimplementing the functions specified in the flow chart block or blocks.These computer program instructions may also be stored in acomputer-readable memory or a storage medium that can direct a computeror other programmable apparatus to function in a particular manner, suchthat the instructions stored in the computer-readable memory or storagemedium produce an article of manufacture including instructionstructures which implement the function specified in the flow chartblock or blocks. The computer program instructions may also be loadedonto a computer or other programmable apparatus to cause a series ofoperational steps to be performed on the computer or other programmableapparatus to produce a computer implemented process such that theinstructions which execute on the computer or other programmableapparatus provide steps for implementing the functions specified in theflow chart block or blocks.

Accordingly, blocks of the flow charts support combinations ofstructures for performing the specified functions and combinations ofsteps for performing the specified functions. It will also be understoodthat each block of the flow charts, and combinations of blocks in theflow charts, can be implemented by special purpose hardware-basedcomputer systems which perform the specified functions or steps, orcombinations of special purpose hardware and computer instructions.

For example, any number of computer programming languages, such as C,C++, C# (CSharp), Perl, Ada, Ruby, Python, Pascal, SmallTalk, FORTRAN,assembly language, and the like, may be used to implement aspects of thepresent application. Further, various programming approaches such asprocedural, object-oriented or artificial intelligence techniques may beemployed, depending on the requirements of each particularimplementation. Compiler programs and/or virtual machine programsexecuted by computer systems generally translate higher levelprogramming languages to generate sets of machine instructions that maybe executed by one or more processors to perform a programmed functionor set of functions.

The term “machine-readable medium” or “storage medium” can be understoodto include any structure that participates in providing data which maybe read by an element of a computer system. Such a medium may take manyforms, including but not limited to, non-volatile media, volatile media,and transmission media. Non-volatile media include, for example, opticalor magnetic disks and other persistent memory. Volatile media includedynamic random access memory (DRAM) and/or static random access memory(SRAM). Transmission media include cables, wires, and fibers, includingthe wires that comprise a system bus coupled to processor. Common formsof machine-readable media include, for example, a floppy disk, aflexible disk, a hard disk, a magnetic tape, a compact flash card, asmart media cart, a SMS card, any other magnetic medium, a CD-ROM, aDVD, or any other optical medium.

The term “ontology” can be understood to represent a formalconceptualization of a particular domain of interests or a definition ofan abstract, view of a world a user desires to present. Suchconceptualization or abstraction is used to provide a complete orcomprehensive description of events, interests, or preferences from theperspective of a user who tries to understand and analyze a body ofinformation.

Each element comprising the ontology can be weighted to have a greateror lesser value in accordance with its significance. Default weights areassumed if unspecified.

The term “hypothesis” can be understood to represent a specificdescription or example extracted, according to the form of ontology,from a body of information, which are collected to find certain events,interests, or preferences. If ontology is deemed as genus, then ahypothesis may be deemed as species. The content described in ahypothesis may be true, potentially true, potentially false, or false ormay be relevant or unrelated to those events, interests, or preferencesthat are sought by a user. Thus, relevant hypotheses that may beinterested by a user need to be detected from all possible hypothesesgenerated from the body of information. Succinctly stated, a hypothesismakes a statement of a tentative explanation for an observation that canbe tested by further investigation. Hypotheses may be true, potentiallytrue, potentially false or false.

Each hypothesis can be assigned a rank. This rank can be computed eitherin a stateless manner or based on the path by which it was discovered,namely based on its path history. Stateless evaluations consider onlythe current position of the exploration whereas historical evaluationsevaluate the previous positions, namely states, which were previouslytraversed to reach the current state.

FIG. 1 depicts an exemplary networked environment 100 in which systemsand methods, consistent with exemplary embodiments, may be implemented.As illustrated, networked environment 100 may include a content server110, a receiver 120, and a network 130. The exemplary simplified numberof content servers 110, receivers 120, and networks 130 illustrated inFIG. 1 can be modified as appropriate in a particular implementation. Inpractice, there may be additional content servers 110, receivers 120,and/or networks 130.

In certain embodiments, a receiver 120 may include any suitable form ofmultimedia playback device, including, without limitation, a computer, agaming system, a cable or satellite television set-top box, a DVDplayer, a digital video recorder (DVR), or a digital audio/video streamreceiver, decoder, and player. A receiver 120 may connect to network 130via wired and/or wireless connections, and thereby communicate or becomecoupled with content server 110, either directly or indirectly.Alternatively, receiver 120 may be associated with content server 110through any suitable tangible computer-readable media or data storagedevice (such as a disk drive, CD-ROM, DVD, or the like), data stream,file, or communication channel.

Network 130 may include one or more networks of any type, including aPublic Land Mobile Network (PLMN), a telephone network (e.g., a PublicSwitched Telephone Network (PSTN) and/or a wireless network), a localarea network (LAN), a metropolitan area network (MAN), a wide areanetwork (WAN), an Internet Protocol Multimedia Subsystem (IMS) network,a private network, the Internet, an intranet, and/or another type ofsuitable network, depending on the requirements of each particularimplementation.

One or more components of networked environment 100 may perform one ormore of the tasks described as being performed by one or more othercomponents of networked environment 100.

FIG. 2 is an exemplary diagram of a computing device 200 that may beused to implement aspects of certain embodiments of the presentapplication, such as aspects of content server 110 or of receiver 120.Computing device 200 may include a bus 201, one or more processors 205,a main memory 210, a read-only memory (ROM) 215, a storage device 220,one or more input devices 225, one or more output devices 230, and acommunication interface 235. Bus 201 may include one or more conductorsthat permit communication among the components of computing device 200.

Processor 205 may include any type of conventional processor,microprocessor, or processing logic that interprets and executesinstructions. Moreover, processor 205 may include processors withmultiple cores. Also, processor 205 may be multiple processors. Mainmemory 210 may include a random-access memory (RAM) or another type ofdynamic storage device that stores information and instructions forexecution by processor 205. ROM 215 may include a conventional ROMdevice or another type of static storage device that stores staticinformation and instructions for use by processor 205. Storage device220 may include a magnetic and/or optical recording medium and itscorresponding drive.

Input device(s) 225 may include one or more conventional mechanisms thatpermit a user to input information to computing device 200, such as akeyboard, a mouse, a pen, a stylus, handwriting recognition, touchscreendisplay, voice recognition, biometric mechanisms, and the like. Outputdevice(s) 230 may include one or more conventional mechanisms thatoutput information to the user, including a display, a projector, an A/Vreceiver, a printer, a speaker, and the like. Communication interface235 may include any transceiver-like mechanism that enables computingdevice/server 200 to communicate with other devices and/or systems. Forexample, communication interface 235 may include mechanisms forcommunicating with another device or system via a network, such asnetwork 130 as shown in FIG. 1.

As will be described in detail below, computing device 200 may performoperations based on software instructions that may be read into memory210 from another computer-readable medium, such as data storage device220, or from another device via communication interface 235. Thesoftware instructions contained in memory 210 cause processor 205 toperform processes that will be described later. Alternatively, hardwiredcircuitry may be used in place of or in combination with softwareinstructions to implement processes consistent with the presentapplication. Thus, various implementations are not limited to anyspecific combination of hardware circuitry and software.

A web browser comprising a web browser user interface may be used todisplay information (such as textual and graphical information) on thecomputing device 200. The web browser may comprise any type of visualdisplay capable of displaying information received via the network 130shown in FIG. 1, such as Microsoft's Internet Explorer browser,Mozilla's Firefox browser, Apple's Safari browser, Google's Chromebrowser or any other commercially available or customized browsing orother application software capable of communicating with network 130.The computing device 200 may also include a browser assistant. Thebrowser assistant may include a plug-in, an applet, a dynamic linklibrary (DLL), or a similar executable object or process. Further, thebrowser assistant may be a toolbar, software button, or menu thatprovides an extension to the web browser. Alternatively, the browserassistant may be a part of the web browser, in which case the browserwould implement the functionality of the browser assistant.

The browser and/or the browser assistant may act as an intermediarybetween the user and the computing device 200 and/or the network 130.For example, source data or other information received from devicesconnected to the network 130 may be output via the browser. Also, boththe browser and the browser assistant are capable of performingoperations on the received source information prior to outputting thesource information. Further, the browser and/or the browser assistantmay receive user input and transmit the inputted data to devicesconnected to network 130.

Similarly, certain embodiments of the present application describedherein are discussed in the context of the global data communicationnetwork commonly referred to as the Internet. Those skilled in the artwill realize that embodiments of the present application may use anyother suitable data communication network, including without limitationdirect point-to-point data communication systems, dial-up networks,personal or corporate Intranets, proprietary networks, or combinationsof any of these with or without connections to the Internet.

FIG. 3 illustrates an embodiment of the corpus/body of data/informationto be processed by a hypotheses generation method as set forth in thepresent application. The corpus of data includes a collection ofavailable data sets that may be related to a group, a time period, apolitical campaign, an economic interest, a personal preference, ageographic area, a social class, or a past/future event. This corpus ofdata collects all type of data from the global network, either public orprivate, including digital and non-digital mediums or sources. As shownin FIG. 3, exemplary types of collected data include, emails, meta data,phone records, text messages, account information, social networkpostings and activities, online comments, purchase orders, blogs, GPSdata, files shared with public or friends, friend lists in socialplatforms, and news articles, and so on without limitation. According toan embodiment, the corpus data includes data obtained by scanning fromnewspaper, printed report, archived files, books, or personal records.The corpus data may also include structured data from transaction logs.This collection of data, in their original form, may or may not bere-organized and every set of data or every piece of data may be treatedas a document.

The value of the data items can be weighted. While all data is ofinterest, data, depending on their characteristics, namely but notlimited to their nature, source of capture, volume, uniqueness, andvariance, can have different associated weights. As such, some data aretreated as being more valuable than others.

FIG. 4(a) illustrates an embodiment of ontology. According to anembodiment, ontology represents a form of a vector having multiplefields. Depending on user's interests, each field may be assigned anattribute in a way that the vector represents a conception or anabstraction of a generalized and comprehensive description of humaninteractions, events, interests, or preferences rather than just aparticular event. The attribute value can be generic so as to cover thefull set of all possible examples and can be semantic so as to beunderstandable and interpretable by a machine, such as a computer.Exemplary generic descriptions that may be used to assign to the fieldsmay include subject, verb, object, adjective, adverb, preposition,location, climate, mood, time, interaction, human interaction, interest,preference, as well as any other generic attributes. According to anembodiment, the ontology has a hierarchical structure, each hierarchyhaving a form of a vector of a matrix. In an alternative embodiment, theontology does not support a hierarchical structure.

Each attribute can be weighted differently depending on itssignificance. That is, while all attributes comprising the ontology areof interest, some attributes, depending on, but not limited to, theirlevel of generality, can have different associated weights. Thus, someattributes are more valuable than others.

An ontology space generated based on the ontology vector as shown inFIG. 4(a) represents an N dimensional space with each Field(n)representing one dimension. When N is 1, the ontology space has onedimension, which is readily understood by a human being. When N becomes2 and 3, the ontology space becomes more complicated, but an analyst canstill visualize it and comprehend the ontology space. However, when N isgreater than 3, going to 4, 5, or even 100 or more, the ontology spacebecomes so complex that a human analyst will find itdifficult-to-impossible intuitively understand the ontology space. Thus,according to one embodiment of the present application, theN-dimensional space is transferred to a lower R-dimensional space, whichmay be transferred to an even lower S-dimensional space, where S<R<N.According to an embodiment, the N attributes in the ontology vector asshown in FIG. 4(a) may be separated into R groups, where each grouprepresents one dimension, thus reducing the N-dimensional ontology spaceinto an R-dimensional space.

According to an embodiment, the ontology vector as shown in FIG. 4(a) isautomatically generated by a computer. An analyst may simply input thecorpus of information need to be analyzed and allow the machine to runthe analysis by itself. The computer may create ontology vectors fromthe corpus of information without any specific instructions from theanalyst. The computer may create abstraction or representationframeworks based on the genre of the information. In this way, a truecomprehensive analysis may be applied to the corpus of informationwithout any restriction by targeted interests of an analyst.

FIG. 4(b) illustrates an embodiment of ontology vector. A vector formhaving three fields such as (subject, verb, object) is used as a form ofontology to detect all data corresponding to the notion “who did what towhom.” More so, such ontology can be, but need not be, produced usingstrictly automated means using natural language processing tools such asparts of speech taggers. This exemplary ontology can generate many setsof hypotheses in a corpus of information, which may or may not be ofparticular interest. For example, for an analysis of a set of reports onpolitical violence using the ontology as shown in FIG. 4(b), thefollowing hypotheses may be generated:

1. “Terrorists kill people”

2. “AQAP bombs prime minister

3. “Late model car with known defect explodes while prime minsterriding.”

Generally speaking, the 1st hypothesis is likely a true statement, butit is such an apparent and generic statement that it would not likelyattract the attention of a human analyst. Thus, the 1st hypothesis isideally set to a lower priority or rank. The 3rd hypothesis is alsopotentially true and not an apparent point for an analyst. However,where the 3rd hypothesis is not specifically related to an analyst'sinquiry or interest—for example finding a terrorist threat—its rankwould not be high for a human analyst. Among all the three hypotheses,an analyst would pay the most attention to the 2nd hypothesis because itis potentially true and not apparent and related to a relativelyparticular terror attack. Therefore, a hypotheses generation method canbe configured not only to generate all hypotheses according to anontology vector but also rank or weight those hypotheses so as topresent the relevant one to a human analyst.

FIG. 5 illustrates an embodiment of a hypotheses generation method 500.At block 502, the system collects and stores all data and information,either digital or non-digital, that could or would have relevantinformation for a targeted subject of interest, for example terrorattacks or extreme weather. The collected data broadly includes anydigitized or searchable data, including data from online, manually-inputdata, scanned and OCR'ed data from non-digital medium including books,print outs, and magnetic tapes, and structured data from transactionlogs. Each set or piece of those collected data may be stored as onedocument or a combination of those data may be treated as a document orrecorded or stored in another digital format known in the art. At block504, a user defines one or more forms of ontology as an ontologyvector(s) for a target subject of interest. For example, a user may use(subject, verb, object) as a form of ontology. According to anembodiment, the forms of ontology are selected by a computer based onthe computer's machine learning experience without any interaction froman analyst or user. It is, however, within the scope of the presentapplication for the ontology to be selected by a user or by acombination of a user and machine learning.

At block 506, the system is coded to the collected data according to theattributes assigned to the ontology vectors. The coding may beimplemented by humans exclusively or by a computer with humansupervision or completely implemented by machines via entity extractionscheme. According to an embodiment, the coding is done for data indifferent languages and dialects. According to an embodiment, the codingis implemented by parallel computing in which plural machines code thedata independently according to techniques known in the art. During theparallel computing process, the corpus of data/information is firstmapped onto a platform of multiple machines and then is codedaccordingly.

After the data are coded, at block 508 the system is configured tocreate an ontology space. The ontology space includes all realizationsof ontology that is assembled into an ordered multidimensional objectsuch as a two dimensional object. The complete collection of differentontological combinations is referred to as ontology space. For example,a coding of data may show 100 choices for each field of the ontologyvector (subject, verb, object). Then, the ontology space in these dataincludes 100³=1 million distinct events. At block 508 the system alsopopulates the ontology space, in which data are classified according tothe ontology of the targeted subject of interest. For example, events indocuments contained in the corpus of the subject of interest areassigned to corresponding points in the ontology space. According to anembodiment, at block 508 the system is configured to support weighing orbiasing certain events. When only a small number of neighborhoods of thetotal space are populated, the system can handle such sparse datawithout difficulty.

The completeness of the ontology space depends on the expansiveness ofthe field of the selected ontology vector. If an attribute isconceptualized at a high level, it is likely to create a more completeontology space than a more specific one. For example, an attribute of“climate” could create more hypotheses than an attribute of“temperature.” According to an embodiment, the completeness degree ofthe ontology is evaluated by comparing results of different ontologyselections because the degree or extent to which the ontology iscomplete depends on the nature of the ontology, i.e., what it wasdeveloped to do or the maturity of the work. In the exemplary ontologyvector of (subject, verb, object), the set of all distinct (S, V, O)combinations is the set of distinct hypotheses, which explain eventsregarding human interactions contained in a corpus. The completeness ofthis (S, V, O) ontology depends on the number of choices for each tripleelement, whether the (subject, verb, object) construct is sufficient todescribe events of interest, and whether indirect objects are needed tobe captured.

At block 510 the system is configured to create a hypothesis space bytransforming the ontology space created in the step 508. The step 510groups and merges similar and related concepts in ontology space,transforming the ontology space into an ordered hypothesis space. Whenthe specific values coded out of the data have many choices for onefield of ontology, many hypotheses may be very similar. For example, inthe (S, V, O) ontology, two hypotheses (group, bombed, bunker) and(group, exploded, bunker) are not distinct events based on securityinterests. According to an embodiment, the merging process may implementclustering techniques including hierarchies, filters/thresholds, topicmodels, and conditional random fields as known in the art. According toan embodiment, the hypothesis space represents hypotheses that aregrouped by relatedness of concepts, in which grouping/merging relatedconcepts in the neighborhood of one another results in a space whereposition relates to clusters of similar hypotheses. As a result of thegrouping/merging process, the hypothesis space can be intuitivelyperceived by a human analyst. When plural documents are mapped intohypotheses in a particular neighborhood, then a human analyst viewingthis clustering could hypothesize that those types of events might haveoccurred.

The hypothesis space can be organized based on personalized criteria.Depending on an individual's or group's identity or role, the likelihoodof novelty and interest of a hypothesis can be estimated. Thus, theranking of the derived hypotheses can be adjusted to account for theseestimates.

At block 512, the system is configured to select relevancy criteria toweighing all the hypotheses. The relevance criteria may be a weighingschema, when applied to the hypotheses, defining a surface in thehypothesis space. The resulting surface has troughs, the depth of whichcorresponds to hypothesis neighborhood. The depth of the troughs isdetermined by the weighing schema applied and is interpreted as beingrelated to likelihood of the neighborhood being a relevant set ofhypotheses, i.e. the more relevant of the neighborhood, the deeper thetrough is. According to an embodiment, the system can be configured toemploy a weighing schema, for instance by employing weighting algorithmor module that weighs based on, for example, the frequency of a word orwords, parts of speech, thresholding of concepts, and/or exclusions(e.g., excluding proper names or locations). By ranking the relativedepths of the resulting N troughs in the hypothesis space, the methodcan identify a rank list of n, where n is less or equal to N, relevanthypotheses to present to a human analysis for testing. For example, themethod may identify the deepest trough, and then the next deepest, andso on.

At block 514, the system is configured to apply an optimizationalgorithm to find the global and/or local minimum or minima of thehypothesis surface. According to an embodiment, the optimizationalgorithm in addition to simulated annealing includes, among others,Monte Carlo based or genetic algorithm based approaches as known in theart. According to an embodiment, at block 512, the system is configuredto employ a simulated annealing process to find the global and rankedlocal minima. The simulated annealing process builds an ensemble ofsimulated annealing runs, each of which corresponds to a random initialpoint in the hypothesis surface. This simulated annealing process ispreferably implemented using parallel computing techniques. Theresulting accounting of the N most frequently occupied wells correspondto the rank list of hypotheses potentially explaining the material inthe corpus.

According to an embodiment, the simulated annealing process isconfigured to model a physical process of heating a solid material andthen slowly lowering the temperature. The physical process decreasesdefects in the material and thus minimizes the system energy. In thisapplication, each iteration of the simulated annealing algorithm entailspicking a new random point on the surface of interest. The distance ofthe new point from the current point, or the extent of a hop along thecorrugated surface, is based on a probability distribution function thatdepends upon “temperature.” The hop is increased from a small distanceto a longer one, similarly to the change of temperature in thecorresponding physical process. The algorithm accepts all new pointsthat lower the energy, but also, with a finite probability, points thatraise the energy. By accepting some points that raise the energy, thealgorithm avoids being trapped in local minima in early iterations andis able to explore globally for better solutions by hopping intopotentially lower troughs on the surface that can only be accessed aftertraversing higher features on the surface.

Random variations or mutations, both in the annealing and geneticprocesses respectively, can be used to prevent the incorrectdetermination of a desired solution, namely a hypothesis of limitedvalue, due to local minima effects. Although providing a better solutionthan its neighboring solutions, better available solutions are missed.

In one embodiment, mutations are guided. At each proposed mutation, theneighborhood can be assessed for fitness. In an annealing process, forexample, fitness can be assessed by the rate of change, as exemplifiedwithout limitation, the slope of descent or accent. In a geneticprocess, the fitness of a population member can be computed. Independentof which process, a mutation can be rejected if the mutation results ina hypothesis space that is deemed highly anticipated. Additionally, therate of mutation can be modified to be a function of the anticipationlevel of the neighborhood initially in (e.g., a nonlinear mapping, asimple proportional dependence, etc). Still further, the level ofanticipation can be based on the profile of the analyst receiving thehypotheses.

Consider the space of all possible hypotheses populated by themachine-coded documents classified according to the ontology.Specifically, in this space there exists a set of clusters defined byvectors pointing from the origin to the different hypotheses implied bythe corpus. Consider a distortion of the space in such a way thattrivial or un-interesting hypotheses occupy one or more specifiedregions of the space. Here, “trivial” and “un-interesting” connotehypotheses that a user expects from the data, without the aid of thedisclosed embodiments. Given a user profile either entered by the useror determined automatically based on, but not limited to, previoushypotheses considered, or the user's individual or group identity orrole, characteristics of “interesting” hypotheses can be determinedusing known in the art topic models or other information retrievalapproaches. These un-interesting clusters (i.e., completely de-weightingsuch clusters) can be masked or deleted thus directing the search topotentially interesting hypotheses avoiding the un-interesting clusters.Hypotheses identified in the resulting constrained search are, bydefinition, interesting since un-interesting hypotheses are removed viathe optimization algorithm.

By utilizing details of the trajectory of the search, or of thestructure of the space itself, interesting hypotheses are identifieddiscriminating them from unintelligible, meaningless hypotheses.

Consider a “hypothesis neighborhood”; that is, the neighborhood inhypothesis space surrounding a given point (i.e., a given hypothesis).Given the previously obtained or determined interests of the user,attributes weights can be established using any of the known informationretrieval techniques such as but not limited to uniqueness, as toassessing the interesting hypothesis neighborhoods. Thus, each point canbe evaluated in the trajectory of a search to see if that neighborhoodhas characteristics of an interesting hypothesis. The neighborhoodsurrounding each point in the search can be summarized to see if theneighborhood possesses the attributes of something near an interestinghypothesis discriminating between interesting and non-interestinghypotheses.

Consider the following illustrative example. For any simulated annealingor similar searching algorithm, at each step in the search, determine ifthe neighborhood indicates anticipated or trivial hypotheses. If so,then that step is skipped, and the next cycle would effectively directthe search in a different direction, one more likely to produceunanticipated, non-trivial hypotheses.

At block 516, the system is configured to present the selectedhypotheses that are relevant to a particular interest or event to ahuman analyst. According to an embodiment, the system can presentspecific hypotheses in a textual format to an analyst. According to anembodiment, the system can present a representation of a hypothesissurface to the analyst. According to an embodiment, the system canpresent a color map representation of the hypotheses to the analyst. Onthe color map, an identification number of a hypothesis can associatedwith a color whose brightness indicates ranking of a hypothesis orrelevancy of a neighborhood.

FIGS. 6-15 illustrate an embodiment of a hypotheses generation methodapplied to monitoring natural disasters as a target subject of interest.For an ontology defined by where a disaster hits, what the disaster is,and how it produces damage that is of interest, an ontology vector of{where, what, how} is selected as ontology. A system collects and/ordata for news reports on disasters caused by storms in a fewmetropolitan areas. Table I (FIG. 18) depicts part of a hypotheticalcorpus of information that could be collected. Table II includes anexemplary computer program used for implementing the method according toan embodiment of the present disclosure. The program in Table IIrepresents an R code. According to an embodiment, the program specifiesa hypothetic ontology and generates a hypothetic corpus, hypotheticweights, and a graphical representation of the corresponding weightedhypotheses. Other graphical representations, including heat maps anddendrograms, may also be used. Non-limiting examples of softwarepackages which can readily implement simulated annealing withinterpreted languages include: R, Octave, Python, Ruby, and Scilab,Matlab, Mathematica, or other similar programs as known in the art.

After coding the corpus of data as described herein, possibilities foreach of the three elements are detected, shown as:

Where: Pittsburgh, Carmichaels, New York, Cincinnati, San Francisco

What: tornado, hurricane, tsunami, storms, earthquake

How: wind, rain, flooding, lightning, shaking

Such ontology can produce 125 potential hypotheses, the first 25 ofwhich are shown in FIG. 6. The collection of distinct combinationsdefining these hypotheses can be represented in a one dimensional columnof ontological triples. Each hypothesis is assigned an identifier, suchas a hypothesis number as shown in FIG. 6.

In another example, when news articles from the Internet are collectedand their headlines are processed, not only are the interested elementscorresponding to one ontology coded, but other potentially interestedelements may also be coded, as shown in FIG. 7. The ontology space ofthe headlines of those articles is shown in FIG. 8. According to anembodiment, the ontology codes may be supplemented by other fields ananalyst may be interested in. For example, the analyst may also want toknow whether the disaster areas belong to urban or rural areas, coded asa “Type of place” field for the ontology vector, as shown in FIG. 9.

The system can be configured to apply one or more weighing criteria. Forexample, the system can be configured to apply a relatively simplerelevancy criterion or criteria, for example, the frequency ofoccurrences of the different hypotheses in the corpus. When such acriterion is applied, the weights are assigned to correspondinghypotheses. FIG. 10 shows the weights for the first 24 hypotheses.

According to an embodiment, the hypothesis space for the entirehypotheses may be represented as a color map, with the brightest colorcorresponding to the most heavily weighted hypotheses and the darkestcolor corresponding to the least weighted hypotheses, as shown in FIG.11. In general, the hypothesis space will be m-dimensional, or aprojection of the higher-dimensional space (e.g., via PCA or similar)into a simpler or lower-dimensional space representation.

According to an embodiment, the weighted hypotheses form a hypothesissurface as shown in FIG. 12. This hypothesis surface corresponds to asurface with peaks and troughs, where the troughs represent the mosthighly weighted hypotheses. To apply simulated annealing to find theglobal minimum of this hypothesis surface, multiple “heating-cooling”cycles may be applied.

As shown in FIG. 13, the simulated annealing process can be thought ofas picking a random hypothesis and placing a ball at that location. Byheating the system, the process applies energy to the ball and it hopsfrom hypothesis to hypothesis, landing in some trough (cycle 1). FIG. 14shows that cycles can be rejected as being not relevant oruninteresting. In a given heating-cooling cycle, the step shown in FIG.13 may not lead to a neighborhood of interesting hypotheses.Accordingly, the step is rejected. Additional cycles as shown in FIG. 15allow the ball to hop and land into deeper troughs (cycle 2), until itdoes not have enough energy to escape (cycle L, in this example). Thisis identified as a candidate for the most likely hypothesis. Accordingto an embodiment, plural simulated annealing cycles are made to build uprank list of relevant potential hypotheses (FIG. 16). The following is alist of relevant hypotheses selected by this simulated annealingprocess:

Hypothesis #91, “Pittsburgh storms lightning”

Hypothesis #63, “New York tsunami flooding”

Hypothesis #122, “Carmichaels earthquake shaking”

Hypothesis #66, “Pittsburgh storms flooding”

In this example, the conclusion would be that storms had affectedPittsburgh (#91).

FIG. 17 shows a flowchart of a method for generating hypotheses. Thegoal is to identify hypotheses to explain observed data, which can beincluded in a plurality of documents. The hypotheses can be a rankedset, and, as discussed further below, the ranked set includes hypothesesconsidered interesting, while other hypotheses are disregarded. Inprocess block 1710, the ontology space can be populated or otherwiseconstructed. For example, all realizations of an ontology can becomputed and assembled into a multi-dimensional object. The differentontological combinations represent different combinations of identifiedelements, such as subject, verb, and object elements. The data can beclassified according to the ontology of interest. For example, events indocuments contained in the corpus of interest can be assigned to thecorresponding points in the ontology space. Weighting can then beapplied to bias certain supported events.

In process block 1720, the ontology space is transformed into ahypothesis space. Related concepts are grouped and merged to transformthe ontology space into an ordered hypothesis space. Approaches formerging include clustering techniques, filters/thresholds, topic models,conditional random fields, etc. The grouping of related concepts in theneighborhood of one another results in a space where position relates toclusters of similar hypotheses. Filtration according to user relevancein a user profile can also be used. Filtration can be performed based onan interest of a user. For example, a user profile can be stored andused for the filtration. The user profile can be generated automaticallybased on previous hypotheses considered, or the user's individual orgroup identity or role. Other techniques can be used for determininguser interest. Filtering reduces the overall hypothesis space, whichpotentially increases the speed of processing due to less data beingprocessed.

In process block 1730, the relevancy criteria can be set by applying aweighting schema to define a surface in the hypothesis space. Theweighting schema defines a surface in the hypothesis space. Theresulting surface has troughs in a simulated annealing representation,the depths of which correspond to hypotheses neighborhoods. Thehypothesis space can be represented in terms of a population (in agenetic algorithm representation), with a fitness function used as aweighting function. Possible weighting functions can include one or moreof the following: simple word frequency, parts of speech, thresholding,a set of notions not of interest can be excluded (e.g., proper names orlocations.) Other weighting functions can also be used. Anotherweighting scheme ensures that non-trivial and interesting hypotheses arefound. Masking or deleting troughs in the simulated annealing contextcorresponds to trivial and un-interesting neighborhoods beingde-weighted. Additionally, deleting a member of the population with alow fitness score in genetic algorithms achieves the same devaluation.The resulting search omits trivial and un-interesting clusters from thesearch, which speeds the overall analysis. By ranking the relativedepths of the resulting troughs in the simulated annealing context andpopulation member fitness in the genetic algorithm context, the rankedlist of N (where N is any integer number) relevant hypotheses can beidentified. Identifying the deepest trough and then the next deepesttrough, etc. is an optimization problem which is known in the art. Theranking can be applied to both trivial/un-interesting andnon-trivial/interesting clusters.

In process block 1740, an optimization problem is solved. In oneexample, simulated annealing can be used to find the global and rankedlocal minima of the hypothesis surface. An ensemble of simulatedannealing runs can be built, each run corresponding to a random initialpoint in the hypothesis surface. The resulting accounting of the N mostfrequently occupied wells corresponds to the rank list of hypothesespotentially explaining the material in the corpus. Representativeoptimization approaches include simulated annealing, genetic algorithms,Monte Carlo, etc. In simulated annealing, a distance from a new point,from a current point, or the extent of a hop along a corrugated surfaceis based on a probability distribution function that depends ontemperature.

At each step in a search trajectory, the neighborhood surrounding thatpoint can be summarized to see if it possesses the attributes ofsomething near an interesting hypothesis. This allows the algorithm todiscriminate between interesting and non-interesting hypotheses. If itis near an interesting point, then accept the step. If not, then it canbe accepted with a low or zero probability.

The overall method has several advantages, including, but not limited to(1) by masking or deleting hypotheses, the overall processing time ispotentially reduced; (2) the method can be performed in real time ornear real time and when a simulated annealing approach is used, thehypotheses generation is a highly parallel computation that can bedistributed in parallel for computational efficiency; and (3) the methodcan be performed in any domain for which one or more ontologies areknown or can be discovered.

Although the operations of some of the disclosed methods are describedin a particular, sequential order for convenient presentation, it shouldbe understood that this manner of description encompasses rearrangement,unless a particular ordering is required by specific language set forthbelow. For example, operations described sequentially may in some casesbe rearranged or performed concurrently. Moreover, for the sake ofsimplicity, the attached figures may not show the various ways in whichthe disclosed methods can be used in conjunction with other methods.

Any of the disclosed methods can be implemented as computer-executableinstructions or a computer program product stored on one or morecomputer-readable storage media and executed on a computing device(e.g., any available computing device, including smart phones or othermobile devices that include computing hardware). Computer-readablestorage media are any available tangible media that can be accessedwithin a computing environment (e.g., one or more optical media discssuch as DVD or CD, volatile memory components (such as DRAM or SRAM), ornonvolatile memory components (such as flash memory or hard drives)).

Any of the computer-executable instructions for implementing thedisclosed techniques as well as any data created and used duringimplementation of the disclosed embodiments can be stored on one or morecomputer-readable storage media. The computer-executable instructionscan be part of, for example, a dedicated software application or asoftware application that is accessed or downloaded via a web browser orother software application (such as a remote computing application).Such software can be executed, for example, on a single local computer(e.g., any suitable commercially available computer) or in a networkenvironment (e.g., via the Internet, a wide-area network, a local-areanetwork, a client-server network (such as a cloud computing network), orother such network) using one or more network computers.

For clarity, only certain selected aspects of the software-basedimplementations are described. Other details that are well known in theart are omitted. For example, it should be understood that the disclosedtechnology is not limited to any specific computer language or program.Likewise, the disclosed technology is not limited to any particularcomputer or type of hardware. Certain details of suitable computers andhardware are well known and need not be set forth in detail in thisdisclosure.

Furthermore, any of the software-based embodiments (comprising, forexample, computer-executable instructions for causing a computer toperform any of the disclosed methods) can be uploaded, downloaded, orremotely accessed through a suitable communication means. Such suitablecommunication means include, for example, the Internet, the World WideWeb, an intranet, software applications, cable (including fiber opticcable), magnetic communications, electromagnetic communications(including RF, microwave, and infrared communications), electroniccommunications, or other such communication means.

The disclosed methods, apparatus, and systems should not be construed aslimiting in any way. Instead, the present disclosure is directed towardall novel and nonobvious features and aspects of the various disclosedembodiments, alone and in various combinations and sub combinations withone another. The disclosed methods, apparatus, and systems are notlimited to any specific aspect or feature or combination thereof, nor dothe disclosed embodiments require that any one or more specificadvantages be present or problems be solved.

The technologies from any example can be combined with the technologiesdescribed in any one or more of the other examples. In view of the manypossible embodiments to which the principles of the disclosed technologymay be applied, it should be recognized that the illustrated embodimentsare examples of the disclosed technology and should not be taken as alimitation on the scope of the disclosed technology.

The particular embodiments disclosed above are illustrative only, as theinvention may be modified and practiced in different but equivalentmanners apparent to those skilled in the art having the benefit of theteachings herein. Furthermore, no limitations are intended to thedetails of construction or design herein shown, other than as describedin the claims below. It is therefore evident that the particularembodiments disclosed above may be altered or modified and all suchvariations are considered within the scope of the invention. Althoughillustrative embodiments of the invention have been described in detailherein with reference to the accompanying drawings, it is to beunderstood that the invention is not limited to those preciseembodiments, and that various changes and modifications can be effectedtherein by one skilled in the art without departing from the scope andspirit of the invention as defined by the appended claims.

1-20. (canceled)
 21. A method of identifying hypotheses in a corpus ofdata, the method comprising: receiving an ontology by one or morecomputers, the ontology including a plurality of fields and a pluralityof choices for each of the fields such that the ontology includes aplurality of ontology vectors that each include one choice for each ofthe fields of the ontology, the ontology vectors being organized as amulti-dimensional space wherein: each dimension of the multi-dimensionalspace represents one or more fields of the ontology; and ontologyvectors representing similar and/or related concepts are closer togetherin the multi-dimensional space than ontology vectors representing and/orunrelated concepts; receiving the corpus of data by the one or morecomputers; identifying ontology vectors in the corpus of data, by theone or more computers, by detecting data in the corpus of data thatcorresponds to ontology vectors in the ontology; grouping the identifiedontology vectors that describe similar and/or related concepts intogroups, by the one more computers, each group of similar and/or relatedontology vectors representing a hypothesis; weighing each of thehypotheses by the one or more computers; and applying an optimizationalgorithm, by the one or more computers, to rank the hypotheses inaccordance with the weight of each hypothesis.
 22. The method of claim21, wherein the optimization algorithm comprises one of a simulatedannealing algorithm, a Monte Carlo-based algorithm, or a geneticalgorithm.
 23. The method of claim 21, wherein the optimizationalgorithm ranks the hypotheses in accordance with the weight of eachhypothesis by: ranking the deepest troughs of a multi-dimensionalsurface having troughs that each represent a group of similar and/orrelated ontology vectors and each have a depth proportional to theweight of the group of ontology vectors; or ranking the fittestpopulation members of a population having population members that eachrepresent a group of similar and/or related ontology vectors and eachhave a fitness proportional to the weight of the group of ontologyvectors.
 24. The method of claim 21, wherein the weight each ofhypothesis is based on frequency of one or more words, parts of speech,thresholding of concepts, or exclusions.
 25. The method of claim 21,wherein: the corpus of data includes a plurality of documents; themethod further comprises weighting one or more of the documents; and theweight each of hypothesis is based at least in part on the weight of oneor more documents having data corresponding to the ontology vectorrepresenting each hypothesis.
 26. The method of claim 24, wherein theweight of each document is based on a source of capture of the document,volume of the document, uniqueness of the document, or variance of thedocument.
 27. The method of claim 21, wherein: the form of the ontologyincludes a weight for each of the fields; and the weight of eachhypothesis is based at least in part on the weight of the fields of theontology vector representing each hypothesis.
 28. The method of claim21, wherein: the ontology includes N fields; and the multi-dimensionalspace includes N dimensions, each of the N dimensions representing oneof the N fields of the ontology.
 29. The method of claim 21, wherein:the ontology includes N fields; grouping the identified ontology vectorscomprises separating the N fields of the ontology into R groups; and themulti-dimensional space includes R dimensions, each of the R dimensionsrepresenting one of the R groups.
 30. The method of claim 21, whereinthe identified ontology vectors that describe similar and/or relatedconcepts are grouped using one or more clustering techniques.
 31. Themethod of claim 30, wherein the one or more clustering techniquesinclude hierarchies, filters and thresholds, topic models, orconditional random fields.
 32. The method of claim 21, wherein theoptimization algorithm de-weights trivial or uninteresting hypothesesby: introducing a random variation or mutation into data representingthe groups of similar and/or related ontology vectors; and determiningan anticipation level of each group of ontology vectors.
 33. The methodof claim 32, wherein: the optimization algorithm comprises a simulatedannealing algorithm; and the anticipation level of each group ofontology vectors is determined based on a slope of descent or accent ofa local minima representing the group of ontology vectors.
 34. Themethod of claim 32, wherein: the optimization algorithm comprises agenetic algorithm; the anticipation level of each group of ontologyvectors is determined based on a fitness level of a population memberrepresenting the group of ontology vectors.
 35. The method of claim 21,wherein weighting and ranking the hypotheses comprises: storingpersonalized criteria of a user; and filtering the hypotheses tode-weight hypotheses that are trivial or uninteresting to the user. 36.The method of claim 35, wherein the personalized criteria is determinedbased on hypotheses previously considered by the user.
 37. The method ofclaim 35, further comprising: storing the identity or role of the user,wherein the personalized criteria is determined based on the identity orrole of the user.
 38. The method of claim 21, wherein the hypotheses areranked based on the path through the multi-dimensional space by whichthe group of ontology vectors representing each hypothesis wasdiscovered by the optimization algorithm.
 39. The method of claim 21,wherein the hypotheses are ranked in a stateless manner based on thepositions of the groups of ontology vectors representing each hypothesesin the multi-dimensional space.
 40. The method of claim 21, furthercomprising: outputting at least some of the ranked hypotheses fordisplay to a user.
 41. A system for identifying hypotheses in a corpusof data, the method comprising: non-transitory computer readable storagemedia that stores the corpus of data; a content server that: receives anontology by one or more computers, the ontology including a plurality offields and a plurality of choices for each of the fields such that theontology includes a plurality of ontology vectors that each include onechoice for each of the fields of the ontology, the ontology vectorsbeing organized as a multi-dimensional space wherein: each dimension ofthe multi-dimensional space represents one or more fields of theontology; and ontology vectors representing similar and/or relatedconcepts are closer together in the multi-dimensional space thanontology vectors representing dissimilar and/or unrelated concepts;identifies ontology vectors in the corpus of data; groups the identifiedontology vectors that describe similar and/or related concepts intogroups, each group of similar and/or related ontology vectorsrepresenting a hypothesis; weighs each of the hypotheses; and using anoptimization algorithm to rank the hypotheses in accordance with theweight of each hypothesis.